This training module was developed by Alexis Payton, Elise Hickman, and Julia E. Rager.
All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.
This module is intended to be a starting guide to cleaning and organizing an example toxicology dataset in Excel. Data wrangling involves cleaning, removing of erroneous data, and restructuring necessary for to preparing wet lab generated data for downstream analyses. These steps will ensure that:
Click here for more information on data wrangling.
In this training tutorial, we’ll make use of an example dataset that needs to be wrangled. The dataset contains concentration values for molecules that were measured using protein-based ELISA technologies. These molecules specifically span 17 sterols and cytokines, selected based upon their important roles in mediating biological responses. These measures were derived from human serum samples. Demographic information also exists for each subject.
The following steps detailed in this training module are by no means exhaustive! Further resources are provided at the end. This module provides example steps that are helpful when wrangling your data in Excel. Datasets often come in many different formats from our wet bench colleagues, therefore some steps will likely need to be added, removed, or amended depending on your specific data.
Open Microsoft Excel and prior to ANY edits, click “File” –> “Save As” to save a new version of the file that can serve as the cleaned version of the data. This is very important for file tracking purposes, and can help in the instance that the original version needs to be referred back to (e.g., if data are accidentally deleted or modified during downstream steps).
Let’s first view what the dataset currently looks like:
The following keyboard shortcuts can help you work more efficiently in Excel:
Note: This only works if there are no cells with missing information or gaps in the columns/rows used to define the peripheral area.
For more available shortcuts on various operating systems click here.
Before we can begin organizing the data, we need to remove the entirely blank rows of cells. This reduces the file size and allows for the use of the filter function in Excel, as well as other organizing functions, which will be used in the next few steps. This step also makes the data look more tidy and amenable to import for coding purposes.
After removing the blank rows, the file should look like the
screenshot below.
There are many ways missing data can be encoded in datasets. This includes values like “blank”, “N/A”, “NA”, or leaving a cell blank. Replacing all missing values with “NA” values is done for 2 reasons:
To check for missing values, the filter function can be used on each column and only select cells with missing values. You may need to scroll to the bottom of the filter pop up window for numerical data. Enter “NA” into the cell of the filtered column. Double click the bottom right corner of the cell to copy the “NA” down the rest of the column.
There was no missing data in this dataset, so this step can be skipped.
Metadata explains what each column represents in the dataset. Metadata is now a required component of data sharing, so it is best to initiate this process prior to data analysis. Ideally, this information is filled in by the scientist(s) who generated the data.
Categorical data are easier to handle in programming languages when they are capitalized and abbreviated. It also helps reduce typos and potential typing mistakes within your script.
For this dataset, the following variables were edited:
Excel Trick: To change cells that contain the same data simultaneously, navigate to “Edit”, click “Find”, and then “Replace”.
Once the categorical data have been abbreviated, add those
abbreviations to the metadata and describe what they symbolize.
For this dataset, we will sort by the column “Group”. This organizes the data and sets it up for the next step.
Analysis-specific subjects are created to give an ordinal subject number to each subject, which allows the scientist to easily identify the number of subjects. In addition, these new ordinal subject numbers will be used to create a subject identifier that combines both a subject’s group and subject number that is helpful for downstream visualization analyses.
R reads in spaces between words as periods, therefore it’s common
practice to replace spaces with underscores when doing data analysis in
R. Avoid using dashes in column names or anywhere else in the dataset.
Programming languages, in general, do not operate well with special symbols and dashes, particularly when included in column identifiers. For this reason, it is best to remove these while cleaning up your data, prior to importing it into R or your preferred programming software.
In this case, this dataset contains dashes and Greek letters within some of the column header identifiers. Here, it is beneficial to remove these dashes (e.g., change IL-10 to IL10) and replace the Greek letters with first letter of the word in English (e.g., change TNF-\(\alpha\) to TNFa).
These data will likely be shared with collaborators, uploaded onto data deposition websites, and used as supporting information in published manuscripts. For these purposes, it is nice to format data in Excel such that it is visually appealing and easy to digest.
For example, here, it is nice to bold column identifiers and center
the data, as shown below:
The subject identifier column labeled, “Group_Subject_No”, combines the subject number with the variable of interest (ie. Group for this dataset). This is useful for analyses to identify outliers by the subject number and the group.
This example dataset is very small, so the demographic data (e.g., sex, race, age) was kept within the same file as the experimentally measured molecules. Though in larger datasets (e.g., genome-wide data, exposomic data, etc), it is often beneficial to separate the demographic data into one file that can be labeled according to the following format: “XXX_Subject_Info_061622” (ie. “Allostatic_Subject_Info_061622”).
This step was not completed for this current data, since it had a smaller size and the downstream analyses were simple.
A wide format contains values that DO NOT repeat the subject identifier column. For this dataset, each subject has one row containing all of its data, therefore the subject identifier occurs once in the dataset.
Wide Format
A long format contains values that DO repeat the subject identifier column. For this dataset, that means a new column was created entitled “Variable” containing all the mediator names and a column entitled “Value” containing all their corresponding values. In the screenshot, an additional column, “Category”, was added to help with the categorization of mediators in R analyses.
Long Format
The reason a long format is preferred is because it makes visualizations and statistical analyses more efficient in R. In the long format, we were able to add a column entitled “Category” to categorize the mediators into “AL Biomarker” or “Cytokine” allowing us to more easily subset the mediators in R. Read more about wide and long formats here.
To convert the data from a wide to long format, follow the steps below:
To do this, a power query in Excel will be used. Note: If you are working on a Mac, you will need to have at least Excel 2016 installed to follow this tutorial, as Power Query is not avaialble for earlier versions. Add-ins are available for Windows users. See this link for more details.
Start by copying all of the data, including the column titles. (Hint: Try using the keyboard shortcut mentioned above.)
Click the tab at the top that says “Data”. Then click “Get Data (Power Query)” at the far left.
It will ask you to choose a data source. Click “Blank table” in the bottom row.
Paste the data into the table. (Hint: Use the shortcut Ctrl +
“v”). At this point, your screen should look like the screenshot below.
Click “Use first row as headers” and then click “Next” in the bottom right hand corner.
Select all the columns with biomarker names. That should be the
column “Cortisol” through the end.
Click the “Transform” button in the upper left hand corner. Then
click “Unpivot columns” in the middle of the pane. The final result
should look like the sceenshot below with all the biomarkers now in one
column entitled “Attribute” and their corresponding values in another
column entitled “Value”.
To save this, go back to the “Home” tab and click “Close &
load”. You should see something similar to the screenshot below.
In the upper right with all the shaded tables (within the “Table” tab), click the arrow to the left of the green table until you see one with no shading. Then click the table with no colors.
Click “Convert to Range” within the “Table” tab. This removes the power query capabilities, so that the data is a regular excel sheet.
Now the “Category” column can be created to identify the types of biomarkers in the dataset. The allostatic load (AL) biomarkers denoted in the “Category” column include the variables Cortisol, CRP, Fibrinogen, Hba1c, HDL, and Noradrenaline. The rest of the variables were labeled as cytokines. Additionally, we can make this data more closely resemble the final long format screenshot by bolding the headers, centering all the data, etc.
We have successfully wrangled our data and the final dataset now
looks like this:
A PivotTable is a tool in Excel used to summarize numerical data. It’s called a pivot table, because it pivots or changes how the data is displayed to make statistical inferences. This can be useful for generating initial summary-level statistics to guage the distribution of data.
To create a PivotTable, start by selecting all of the data. (Hint:
Try using the keyboard shortcut mentioned above.) Click “Insert” tab on
the upper left-hand side, click “PivotTable”, and click “OK”. The new
PivotTable should be available in a new sheet as seen in the screenshot
below.
A PivotTable will be constructed based on the column headers that can
be dragged into the PivotTable fields located on the right-hand side.
For example, what if we were interested in determining if there were
differences in average expression between non-smokers and cigarette
smokers in each category of biomarkers? As seen below, drag the “Group”
variable under the “Rows” field and drag the “Value” variable under the
“Values” field.
Notice that it automatically calculates the sum of the expression
values for each group. To change the function to average, click the “i”
icon and select “Average”. The output should mirror what’s below with
non-smokers having an average expression that’s more than double that of
cigarette smokers.
For the most part, it’s better to perform final analyses in R (or another programming language) rather than Excel for the following reasons…
However, Excel is still a software that has many benefits for running analyses including…
Depending on each scientist’s skill-level and the complexity of the analysis, Excel or R could be beneficial.
In summary, this training module highlights the importance of data wrangling and how to do so in Microsoft Excel for downstream analyses. Concepts discussed include helpful Excel features like power queries and pivot tables and when to use Microsoft Excel vs. R.
Data wrangling in Excel can be expedited with knowledge of useful features and functions to format data. Check out the resources below for additional information on Excel tricks.